Under the Semantic Web view of OBDA, OBDA systems take as input an OWL 2 QL ontology (TBox), a database instance over some database schema, and a mapping file that relates queries over the database to assertions in the ontology (ABox).
Benchmarks for OBDA systems should ideally give means to perform scalability analyses with respect to each of these measures of the input. The NPD benchmark allows for scalability analyses with respect to the size of the data component by using VIG to scale a data instance produced starting from real-world data available in the NPD FactPages.
Althoug VIG is currently being used with the NPD benchmark, it is not specific to that setting. To generate data with VIG, indeed, it suffices to provide it a source data instance, stored in a mysql
RDBMS, and (optionally) a mappings file. Notice that the mysql
requirement is not really a limitation as conversion tools between RDBMS systems are generally available, and because the only information from the schema used by VIG consists of primary and foreign key constraints (i.e., triggers or stored procedures do not need to be translated). VIG produces data in the form of CSV files, that can be then imported into any RDBMS system. Before including VIG in your benchmark, please check whether the generation strategy and similarity measures guaranteed by it are suitable for your scenario, by referring to the wiki page [[Characteristics of The Data Produced By VIG]]. Briefly, your decision should be based on the following criteria:
- Primary Keys: VIG supports both single and multi-attribute primary keys.
- Foreign Keys: VIG does NOT support multi-attribute foreign keys, but only single-attribute foreign keys.
- Datatypes: The list of datatypes accepted by VIG can be found here. Please contact the authors if you need support for additional datatypes.
- Disjointness Support: VIG can generate data while satisfying disjointness constraints specified in the ontology, provided that the source data instance satisfies them and that individuals for disjoint classes are built in the mappings from values of exactly one column. For detailed information, please refer to here.
- Clustering through labels: OBDA mapping files often contain individuals which are created out of the partitioning of a column in the database w.r.t. to a fixed set of values in another column (e.g., NPD or LUBM benchmarks). VIG can automatically detect such situations by mapping analysis, while providing the user the ability to manually specify additional fixed domain columns. For detailed information, please refer to here.
- Quality of the Data Produced: VIG makes some assumptions on the values in the columns, for instance values in columns are always produced according to a uniform distribution. If your source data instance contains many columns where this assumption is significantly violated, then VIG will not be able to reflect this property in the generated data.
- Performance: VIG is extremely efficient as values for different columns are generated independently and in constant time. If you need to generate large amounts of data, then VIG might be a good candidate for your job. Here you can find some results for the generation of data for the NPD Benchmark.
- Parallelization: VIG can be parallelized up to the number of columns in the database, without communication overhead. Therefore, if you have access to a cluster for generating your data, you can benefit from this feature to further speed-up the generation process. Details on how to use this feature can be found here.